Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach
نویسندگان
چکیده
Document summarization is an important task in the area of natural language processing, which aims to extract the most important information from a single document or a cluster of documents. In various summarization tasks, the summary length is manually defined. However, how to find the proper summary length is quite a problem; and keeping all summaries restricted to the same length is not always a good choice. It is obviously improper to generate summaries with the same length for two clusters of documents which contain quite different quantity of information. In this paper, we propose a Bayesian nonparametric model for multidocument summarization in order to automatically determine the proper lengths of summaries. Assuming that an original document can be reconstructed from its summary, we describe the ”reconstruction” by a Bayesian framework which selects sentences to form a good summary. Experimental results on DUC2004 data sets and some expanded data demonstrate the good quality of our summaries and the rationality of the length determination.
منابع مشابه
A survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملUsing Outcome Polarity in Sentence Extraction for Medical Question-Answering
Multiple pieces of text describing various pieces of evidence in clinical trials are often needed in answering a clinical question. We explore a multi-document summarization approach to automatically find this information for questions about effects of using a medication to treat a disease. Sentences in relevant documents are ranked according to various features by a machine learning approach. ...
متن کاملA Novel Feature-based Bayesian Model for Query Focused Multi-document Summarization
Supervised learning methods and LDA based topic model have been successfully applied in the field of multi-document summarization. In this paper, we propose a novel supervised approach that can incorporate rich sentence features into Bayesian topic models in a principled way, thus taking advantages of both topic model and feature based supervised learning methods. Experimental results on DUC200...
متن کاملBringing Summarization to End Users: Semantic Assistants for Integrating NLP Web Services and Desktop Clients
We present PathSum, a high-performing hierarchical-topic based singleand multi-document automatic text summarization framework. This approach leverages Bayesian nonparametric methods to model sentences as paths through a tree and create a hierarchy of topics from the input in an unsupervised setting. We describe the generative model used to learn a topic tree based on hierarchical latent Dirich...
متن کاملMulti-Document Summarization using Sentence-based Topic Models
Most of the existing multi-document summarization methods decompose the documents into sentences and work directly in the sentence space using a term-sentence matrix. However, the knowledge on the document side, i.e. the topics embedded in the documents, can help the context understanding and guide the sentence selection in the summarization procedure. In this paper, we propose a new Bayesian s...
متن کامل